[SPARK-32270][SQL] Use TextFileFormat in CSV's schema inference with a different encoding #29063

HyukjinKwon · 2020-07-10T09:25:27Z

What changes were proposed in this pull request?

This PR proposes to use text datasource in CSV's schema inference. This shares the same reasons of SPARK-18362, SPARK-19885 and SPARK-19918 - we're currently using Hadoop RDD when the encoding is different, which is unnecessary. This PR completes SPARK-18362, and address the comment at #15813 (comment).

We should better keep the code paths consistent with existing CSV and JSON datasources as well, but this CSV schema inference with the encoding specified is different alone.

There can be another story that this PR might lead to a bug fix: Spark session configurations, say Hadoop configurations, are not respected during CSV schema inference when the encoding is different (but it has to be set to Spark context for schema inference when the encoding is different).

Why are the changes needed?

For consistency, potentially better performance, and fixing a potentially very corner case bug.

Does this PR introduce any user-facing change?

Virtually no.

How was this patch tested?

Existing tests should cover.

SparkQA · 2020-07-10T09:33:04Z

Test build #125594 has started for PR 29063 at commit 3250f33.

HyukjinKwon · 2020-07-11T00:54:09Z

retest this please

SparkQA · 2020-07-11T07:05:02Z

Test build #125658 has finished for PR 29063 at commit 3250f33.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-07-11T15:20:10Z

retest this please

srowen

PS does this suggest a change to the way spark-xml reads things? separate question of course.

HyukjinKwon · 2020-07-11T15:31:34Z

Ah, it does. We should better change spark-xml to use SQL source instead of RDD APIs during schema inference but I think that would be difficult because the record separator in this case is newline in CSV and JSON but spark-xml is dependent on the custom Hadoop input format ...

In most cases it wouldn't be a big deal so I guess it's fine to don't change at this moment unless any major issue is found.
I think we could solve this issue together once we migrate from DSv1 to DSv2 in Spark XML .. but I guess it's a bit far future ..

SparkQA · 2020-07-11T20:40:01Z

Test build #125683 has finished for PR 29063 at commit 3250f33.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thank you, @HyukjinKwon . Yes, SQL code path is better.
Merged to master.

HyukjinKwon · 2020-07-12T23:43:37Z

Thanks @srowen and @dongjoon-hyun

…a different encoding This PR proposes to use text datasource in CSV's schema inference. This shares the same reasons of SPARK-18362, SPARK-19885 and SPARK-19918 - we're currently using Hadoop RDD when the encoding is different, which is unnecessary. This PR completes SPARK-18362, and address the comment at apache#15813 (comment). We should better keep the code paths consistent with existing CSV and JSON datasources as well, but this CSV schema inference with the encoding specified is different from UTF-8 alone. There can be another story that this PR might lead to a bug fix: Spark session configurations, say Hadoop configurations, are not respected during CSV schema inference when the encoding is different (but it has to be set to Spark context for schema inference when the encoding is different). For consistency, potentially better performance, and fixing a potentially very corner case bug. Virtually no. Existing tests should cover. Closes apache#29063 from HyukjinKwon/SPARK-32270. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…aframe read / write API ### What changes were proposed in this pull request? This PR is a retry of #47328 which replaces RDD to Dataset to write SparkR metadata plus this PR removes `repartition(1)`. We actually don't need this when the input is single row as it creates only single partition: https://github.com/apache/spark/blob/e5e751b98f9ef5b8640079c07a9a342ef471d75d/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala#L49-L57 ### Why are the changes needed? In order to leverage Catalyst optimizer and SQL engine. For example, now we leverage UTF-8 encoding instead of plain JDK ser/de for strings. We have made similar changes in the past, e.g., #29063, #15813, #17255 and SPARK-19918. Also, we remove `repartition(1)`. To avoid unnecessary shuffle. With `repartition(1)`: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=6] +- LocalTableScan [_1#0] ``` Without `repartition(1)`: ``` == Physical Plan == LocalTableScan [_1#2] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI in this PR should verify the change ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47341 from HyukjinKwon/SPARK-48883-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…aframe read / write API ### What changes were proposed in this pull request? This PR is a retry of apache#47328 which replaces RDD to Dataset to write SparkR metadata plus this PR removes `repartition(1)`. We actually don't need this when the input is single row as it creates only single partition: https://github.com/apache/spark/blob/e5e751b98f9ef5b8640079c07a9a342ef471d75d/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala#L49-L57 ### Why are the changes needed? In order to leverage Catalyst optimizer and SQL engine. For example, now we leverage UTF-8 encoding instead of plain JDK ser/de for strings. We have made similar changes in the past, e.g., apache#29063, apache#15813, apache#17255 and SPARK-19918. Also, we remove `repartition(1)`. To avoid unnecessary shuffle. With `repartition(1)`: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=6] +- LocalTableScan [_1#0] ``` Without `repartition(1)`: ``` == Physical Plan == LocalTableScan [_1#2] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI in this PR should verify the change ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47341 from HyukjinKwon/SPARK-48883-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

Use text file sources in CSV's schema inference for consistency

3250f33

probot-autolabeler bot added the SQL label Jul 10, 2020

HyukjinKwon mentioned this pull request Jul 10, 2020

[SPARK-18362][SQL] Use TextFileFormat in implementation of CSVFileFormat #15813

Closed

HyukjinKwon changed the title ~~[SPARK-32270][SQL] Use text file sources in CSV's schema inference in different encoding as well~~ [SPARK-32270][SQL] Use TextFileFormat in CSV's schema inference with a different encoding Jul 10, 2020

srowen reviewed Jul 11, 2020

View reviewed changes

dongjoon-hyun approved these changes Jul 12, 2020

View reviewed changes

dongjoon-hyun closed this in c4b0639 Jul 12, 2020

HyukjinKwon deleted the SPARK-32270 branch July 27, 2020 07:43

This was referenced Jul 14, 2024

[SPARK-48883][ML][R] Replace RDD read / write API invocation with Dataframe read / write API #47341

Closed

[SPARK-48896][ML][MLLIB] Avoid repartition when writing out the metadata #47347

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32270][SQL] Use TextFileFormat in CSV's schema inference with a different encoding #29063

[SPARK-32270][SQL] Use TextFileFormat in CSV's schema inference with a different encoding #29063

HyukjinKwon commented Jul 10, 2020 •

edited

Loading

SparkQA commented Jul 10, 2020

HyukjinKwon commented Jul 11, 2020

SparkQA commented Jul 11, 2020

HyukjinKwon commented Jul 11, 2020

srowen left a comment

HyukjinKwon commented Jul 11, 2020 •

edited

Loading

SparkQA commented Jul 11, 2020

dongjoon-hyun left a comment

HyukjinKwon commented Jul 12, 2020

[SPARK-32270][SQL] Use TextFileFormat in CSV's schema inference with a different encoding #29063

[SPARK-32270][SQL] Use TextFileFormat in CSV's schema inference with a different encoding #29063

Conversation

HyukjinKwon commented Jul 10, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Jul 10, 2020

HyukjinKwon commented Jul 11, 2020

SparkQA commented Jul 11, 2020

HyukjinKwon commented Jul 11, 2020

srowen left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Jul 11, 2020 • edited Loading

SparkQA commented Jul 11, 2020

dongjoon-hyun left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Jul 12, 2020

HyukjinKwon commented Jul 10, 2020 •

edited

Loading

HyukjinKwon commented Jul 11, 2020 •

edited

Loading